Exploring White Wine Dataset with R by Li Chang

As the final project for the Udacity Data Analysis with R, I decided to educate myself about the white wine. I downloaded dataset from here and loaded it in R. The research question is what physicochemical properties will affect the taste preference.

I - Univariate Plots Section

Below I examined each variable in the dataset. I started with a basic plot and then revised the plot to be more clear and user-frienly. It helped me understand the distribution of each variable and decide if I need to tidy up something.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

quality

In our dataset, the “quality” variable ranges between 3 and 9 with a median of 6, so there is neither very bad nor very excellent wine but mostly averge wines. Also, there are only 25 wines rated either 3 or 9. From bivariate section, I excluded these 25 cases from the analysis. Though quality is an integer, it makes more sense to be converted to an ordinal variable so I can compare the physicochemicals across different wine qualities.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

fixed acidity

The basic histogram shows that fixed acidity has really few values less than 3 and a long tail after 10. So I limit the x axis range. Changing binwidth also shows more clearly that the majority of the fixed acidities fall between 5.5 and 8.5.

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

volatile acidity

After adjusting bin width, I can see that most wines have an acetic acid between 0.15-0.4g/l, with a median value at 0.28g/l. I know that a high level of acetic acid will cause an unpleasant vigenar taste and therefore poor sensory rating. I can test it in the next section.

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

citric acid

The majority of citric acidity level fall between 0.15-0.5g/l with a spike at the level of 0.49g/l. In contrast to volatile acidity, citric acidity add freshness to the wine.

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

residual sugar

Residual sugar has a wide range between 0.6-65.8g/l while the median is only 5.2g/l. This is because wine producers try to cater to varying consumers’ preference of sweetness. Some people like me favor sweet wines, while others might prefer bone dry.

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

chlorides

Most wines has an amount of sodium chloride between 0.025-0.06g/l, with a median of 0.043g/l. The highest level in this dataset is 0.346g/l.

free sulfur dioxide

The median value of free sulfur dioxide is 34 mg/l and it has a wide range from 2 to 289 mg/l with the majority of the value falling between 10-55 mg/l. Since free sulfur dioxide becomes noticeable at 50 mg/l, I assume it will affect the taste.

total sulfur dioxide

Similar to free sulfur dioxide, total sulfur dioxide also has a wide range from 9 to 440 mg/l with a median value at 134 mg/l.

density

Density has a small range between 0.99 to 1.04g/cc. It mostly depends on the percent of alcohol and sugar in the wine.

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

PH

PH has a small range between 2.7 to 3.8 but obviously highly acetic!

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

alcohol

An appropriate level of alcohol enhances the flavor but a high level of alcohol will cause a negative burning sensation. But our white wine dataset doesn’t appear to have very high alcohol level. The median is 10.4% and the majority of values fall between 9% to 13%.

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Univariate Analysis

The dataset has a total of 4898 observations and 12 variables plus one indicating the wine ID (‘X’). Almost all variables except wine quality are numeric.

  1. Majority of white wine in this dataset is rated between 4-8 on a scale of 10 with a median of 6. We don’t have very bad or very excellent wines. Quality is the main feature I am interested in. I wonder what physicochemical elements will influence the taste preference. However, there are only 20 wines rated 3 and 5 wines rated 9. I think it’s better to drop these few cases before the bivariate and multi-variate analysis just so that we can focus more on the bulk of the data. Other than that, the dataset is relatively clean.

  2. Residual sugar appear to have a wide range between 0.6-65.8g/l, supposedly to accomdate customers’ varying palettes for sweetness.

  3. Free sulfur dioxide ranges between 2 to 289mg/l but be aware that the smell becomes noticeable at the level above 50 mg/l.

  4. White wine is highly acetic with pH level ranging from 2.7 to 3.8.

  5. Below I used recursive feature elmination(RFE) to explore what the important features are. RFE repeats the process of choosing the best peforming feature(s) and refit the model with the remaining features to determine the importance of each feature. It is more often used in model building process but I think it would be valuable to display the importance of each feature. Here is a good tutorial. The model I used for the feature selection is random forest, a method aiming to overcome the overfitting problem of decision trees. The graph below shows that the R-squared value keeps rising with more variables added but degree of increase slows down after three factors: volatile acidity, alcohol, and free sulfur dioxide. Alcohol is no doubt a very important feature as no one would prefer bland wines. Both volatile acidity and free sulfur dioxide have such “volatile” properties that an exessive amount will cause unpleasant smell or taste. So I decided to use volatile acidity, alcohol and free sulfur dioxide as the main features of my interest for further investigation.

## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (5 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables   RMSE Rsquared  RMSESD RsquaredSD Selected
##          2 0.7535   0.2773 0.02850    0.07441         
##          3 0.6727   0.4235 0.01978    0.01431         
##          4 0.6464   0.4718 0.02131    0.01959         
##          5 0.6325   0.4990 0.02469    0.02162         
##          6 0.6234   0.5083 0.02587    0.02594         
##          7 0.6187   0.5182 0.02658    0.02776         
##          8 0.6152   0.5254 0.02979    0.03305         
##          9 0.6109   0.5309 0.02664    0.02828         
##         10 0.6056   0.5410 0.02676    0.02970         
##         11 0.6031   0.5450 0.02662    0.02952        *
## 
## The top 5 variables (out of 11):
##    volatile.acidity, alcohol, free.sulfur.dioxide, pH, residual.sugar
##  [1] "volatile.acidity"     "alcohol"              "free.sulfur.dioxide" 
##  [4] "pH"                   "residual.sugar"       "citric.acid"         
##  [7] "chlorides"            "fixed.acidity"        "sulphates"           
## [10] "total.sulfur.dioxide" "density"

  1. Quality is converted from integer to factor. I also used 50mg/l to create a new variable “free.sulfur.dioxide.cat” as “noticeable” (free.sulfur.dioxide > 50) and “not noticeable” (free.sulfur.dioxide <= 50).

  2. Looking through the summary data above, residual.sugar, free.sulfur.dioxide, and total.sulfur.dioxide appear to have wide ranges. In particular, wines with residual sugar above 45 g/l are considered very sweet. However, I don’t think the wide distribution should be adjusted as some wines may display very far-stretching characteristics.

II - Bivariate Plots and Analysis

Below I used correlation matrix to visualize the pair-wise correlation between two variables. Quality and alcohol appear to have moderate positive correlation (0.44). In very contrast to the feature importance generated under RFE in Section I, volatile acidity and quality exhibit very weak negative relationship (corr = -0.19). Between free sulfur dioxide and quality, there is barely any correlation (corr = 0.02). It’s possible that the relationship of two variables is masked in the presence of other factors.

Main Features of Interest

Interestingly, the relationship between alcohol and rating doesn’t seems to be linearly positive. This leads me to think that there must be other factors that cause this parabola curve between alcohol and quanlity rating. Anyway, for good wines (quality rating above 5), the higher the alcohol level is, the better the rating is. For example, the median alcohol level among wines that are rated 8 is 12%, a quarter more than the 9.5% among wines that are rated 5.

## group: 3
## NULL
## -------------------------------------------------------- 
## group: 4
##   vars   n  mean sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 163 10.15  1   10.1   10.08 1.04 8.4 13.5   5.1  0.7     0.15 0.08
## -------------------------------------------------------- 
## group: 5
##   vars    n mean   sd median trimmed  mad min  max range skew kurtosis
## 1    1 1457 9.81 0.85    9.5    9.71 0.74   8 13.6   5.6 1.08     1.07
##     se
## 1 0.02
## -------------------------------------------------------- 
## group: 6
##   vars    n  mean   sd median trimmed  mad min max range skew kurtosis
## 1    1 2198 10.58 1.15   10.5   10.52 1.33 8.5  14   5.5  0.4    -0.72
##     se
## 1 0.02
## -------------------------------------------------------- 
## group: 7
##   vars   n  mean   sd median trimmed  mad min  max range skew kurtosis
## 1    1 880 11.37 1.25   11.4   11.42 1.33 8.6 14.2   5.6 -0.3    -0.56
##     se
## 1 0.04
## -------------------------------------------------------- 
## group: 8
##   vars   n  mean   sd median trimmed  mad min max range  skew kurtosis  se
## 1    1 175 11.64 1.28     12   11.78 1.19 8.5  14   5.5 -0.89     0.01 0.1
## -------------------------------------------------------- 
## group: 9
## NULL

Alcohol level reinforces acidity but the chart below shows a non-linear relationship between the two. This explains why in Section I we saw a small correlation at -0.19. I then divided alcohol level into five groups with equal intervals. The median of volatile acidity among wines with a high level of alcohol (13-14.2%) is 0.35, which is significantly higer than alcohol group between 10.5-11.7% (median = 0.24).

## (7.99,9.24] (9.24,10.5] (10.5,11.7]   (11.7,13]   (13,14.2] 
##         841        1723        1382         789         138
## group: (7.99,9.24]
##   vars   n mean  sd median trimmed  mad min  max range skew kurtosis se
## 1    1 841 0.28 0.1   0.27    0.27 0.07 0.1 0.82  0.71 1.46     3.25  0
## -------------------------------------------------------- 
## group: (9.24,10.5]
##   vars    n mean  sd median trimmed  mad  min max range skew kurtosis se
## 1    1 1723 0.28 0.1   0.26    0.27 0.09 0.08   1  0.92  1.7     5.88  0
## -------------------------------------------------------- 
## group: (10.5,11.7]
##   vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## 1    1 1382 0.26 0.09   0.24    0.25 0.07 0.09 0.96  0.88 1.79     6.87  0
## -------------------------------------------------------- 
## group: (11.7,13]
##   vars   n mean  sd median trimmed  mad  min max range skew kurtosis se
## 1    1 789  0.3 0.1   0.29    0.29 0.07 0.08 1.1  1.02 1.45     6.36  0
## -------------------------------------------------------- 
## group: (13,14.2]
##   vars   n mean   sd median trimmed mad  min  max range skew kurtosis   se
## 1    1 138 0.37 0.12   0.35    0.36 0.1 0.15 0.78  0.64 0.87     1.18 0.01

Free sulfur dioxide decreases when alcohol level rises. For example, among wines with alcohol level at 7.99-9.24%, the median value of free sulfur dioxide is 42 mg/l, 50% higher than the median free sulfur dioxide (28 mg/l) found in the alcohol group between 13-14.2%.

## group: (7.99,9.24]
##   vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## 1    1 841 41.54 15.77     42   41.37 16.31   5 128   123 0.32     0.94
##     se
## 1 0.54
## -------------------------------------------------------- 
## group: (9.24,10.5]
##   vars    n  mean    sd median trimmed   mad min   max range skew kurtosis
## 1    1 1723 37.55 17.98     36   36.75 19.27   3 138.5 135.5 0.57     0.64
##     se
## 1 0.43
## -------------------------------------------------------- 
## group: (10.5,11.7]
##   vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## 1    1 1382 31.97 15.26     31   31.06 14.83   2 131   129 0.96     2.67
##     se
## 1 0.41
## -------------------------------------------------------- 
## group: (11.7,13]
##   vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## 1    1 789 30.47 12.71     30      30 11.86   3  96    93 0.77     2.52
##     se
## 1 0.45
## -------------------------------------------------------- 
## group: (13,14.2]
##   vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## 1    1 138 27.88 12.14     28   27.35 13.34   3  65    62  0.4    -0.07
##     se
## 1 1.03

Although the feature selection placed free sulfur dioxide and volatile acidity among the top three important features, charting against quality rating doesn’t show strong relatinship, which could be masked by other factors.

Other Interesting Features

Density depends on the percentage of alcohol in the water. The plot below clearly shows that the density decreases when the amount of alcohol increases. Additionally, density increases with sugar content. Residual sugar and alcohol explains more than 90% of the variance in density. So density shouldn’t be included in the presence of alcohol or sugar in any modelling, just to minimize multicullinearity.

## 
## Call:
## lm(formula = density ~ alcohol + residual.sugar, data = wine)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0020165 -0.0005822 -0.0001410  0.0004685  0.0249803 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.005e+00  1.352e-04  7430.0   <2e-16 ***
## alcohol        -1.225e-03  1.191e-05  -102.9   <2e-16 ***
## residual.sugar  3.607e-04  2.888e-06   124.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0009121 on 4870 degrees of freedom
## Multiple R-squared:  0.907,  Adjusted R-squared:  0.907 
## F-statistic: 2.375e+04 on 2 and 4870 DF,  p-value: < 2.2e-16

It’s said that at free sulfur dioxide over 50mg/l, you can smell the odor of sulfur in wine. There is an infliction point when free sulfur dioxide reaches 50mg/l. The correlation between the two is 0.51 among wines with less than 50 mg/l of free sulfur dioxide but reduces to 0.19 among wines with higher free sulfur dioxide.

## [1] 0.5062314
## [1] 0.1907176

III - Multivariate Plots and Analysis

Within the same quality group, a wine without noticeable sulfur smell is likely to have higher alcohol level. For example, among wines that are rated 6 and don’t have noticeable sulfur smell, the median alcohol by volume is 10.6% as compared to 9.6% among wines with the same rating but noticeable sulfur smell. In other words, keeping alcohol constant, you will likely get a better rated wine if the sulfur level is unnoticeable. Also, for people who are concerned with the health issue caused by sulfur, choosing a wine with higher concentration of alcohol would likely reduce the intake of sulfur (well, if alcohol is less of concern to them).

## : <= 50mg/l, not noticeable
## : 3
## NULL
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 3
## NULL
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 4
##   vars   n  mean   sd median trimmed  mad min  max range skew kurtosis
## 1    1 149 10.23 0.99   10.1   10.15 1.04 8.4 13.5   5.1 0.69     0.17
##     se
## 1 0.08
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 4
##   vars  n mean  sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 14 9.36 0.8   9.15    9.24 0.74 8.6 11.5   2.9 1.24     1.01 0.21
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 5
##   vars    n mean   sd median trimmed  mad min  max range skew kurtosis
## 1    1 1108 9.91 0.87    9.7    9.82 0.74 8.4 13.6   5.2 0.92     0.57
##     se
## 1 0.03
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 5
##   vars   n mean   sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 349 9.49 0.68    9.4     9.4 0.44   8 13.5   5.5 1.79     5.28 0.04
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 6
##   vars    n  mean   sd median trimmed  mad min max range skew kurtosis
## 1    1 1808 10.72 1.15   10.6   10.68 1.19 8.5  14   5.5 0.27     -0.8
##     se
## 1 0.03
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 6
##   vars   n mean   sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 390  9.9 0.87    9.6    9.81 0.82 8.5 13.1   4.6 1.02     0.93 0.04
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 7
##   vars   n  mean   sd median trimmed  mad min  max range  skew kurtosis
## 1    1 795 11.43 1.25   11.5   11.49 1.33 8.6 14.2   5.6 -0.37     -0.5
##     se
## 1 0.04
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 7
##   vars  n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 85 10.82 1.05     11   10.81 0.89   9 13.3   4.3 0.04    -0.68 0.11
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 8
##   vars   n  mean   sd median trimmed  mad min max range  skew kurtosis  se
## 1    1 151 11.69 1.24     12   11.83 1.19 8.5  14   5.5 -0.87     0.13 0.1
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 8
##   vars  n mean   sd median trimmed  mad min  max range  skew kurtosis  se
## 1    1 24 11.3 1.49   11.8   11.38 1.04 8.9 12.9     4 -0.78    -1.09 0.3
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 9
## NULL
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 9
## NULL

When I break down quality by alcohol level and volatile acidity, for alcohol group between 7.99-10.1%, the negative relationship between volatile acidity and quality becomes the strongest. For example, among 7.99-10.1% alcohol categories, the median value of volatile acidity decreases from 0.34 g/l for less desirable wines (quality = 4) to 0.19 g/l for highly rated ones (quality = 8); the former group also has a higher variation of volatile acidity (sd = 0.31) than the latter one (sd = 0.03).

## (7.99,10.1] (10.1,12.1] (12.1,14.2] 
##        2078        2143         652

## : (7.99,10.1]
## : 3
## NULL
## -------------------------------------------------------- 
## : (10.1,12.1]
## : 3
## NULL
## -------------------------------------------------------- 
## : (12.1,14.2]
## : 3
## NULL
## -------------------------------------------------------- 
## : (7.99,10.1]
## : 4
##   vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## 1    1 81 0.38 0.15   0.34    0.37 0.13 0.11 0.91   0.8 0.91     0.48 0.02
## -------------------------------------------------------- 
## : (10.1,12.1]
## : 4
##   vars  n mean   sd median trimmed  mad  min max range skew kurtosis   se
## 1    1 75 0.37 0.18   0.32    0.35 0.12 0.16   1  0.84 1.41     1.81 0.02
## -------------------------------------------------------- 
## : (12.1,14.2]
## : 4
##   vars n mean   sd median trimmed  mad  min max range skew kurtosis   se
## 1    1 7 0.51 0.31   0.32    0.51 0.12 0.24 1.1  0.86 0.81    -0.93 0.12
## -------------------------------------------------------- 
## : (7.99,10.1]
## : 5
##   vars    n mean  sd median trimmed  mad min max range skew kurtosis se
## 1    1 1006 0.31 0.1    0.3     0.3 0.07 0.1 0.9   0.8 1.32     3.52  0
## -------------------------------------------------------- 
## : (10.1,12.1]
## : 5
##   vars   n mean  sd median trimmed  mad  min  max range skew kurtosis se
## 1    1 431 0.28 0.1   0.27    0.27 0.07 0.13 0.85  0.72 1.77     5.33  0
## -------------------------------------------------------- 
## : (12.1,14.2]
## : 5
##   vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## 1    1 20 0.36 0.13   0.31    0.34 0.08 0.18 0.67  0.49 0.93     -0.3 0.03
## -------------------------------------------------------- 
## : (7.99,10.1]
## : 6
##   vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## 1    1 833 0.26 0.08   0.24    0.25 0.07 0.08 0.68   0.6 1.22     2.98  0
## -------------------------------------------------------- 
## : (10.1,12.1]
## : 6
##   vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## 1    1 1084 0.25 0.08   0.24    0.24 0.07 0.08 0.96  0.88 1.82     7.74  0
## -------------------------------------------------------- 
## : (12.1,14.2]
## : 6
##   vars   n mean  sd median trimmed  mad  min  max range skew kurtosis   se
## 1    1 281 0.31 0.1   0.29     0.3 0.07 0.11 0.78  0.68 1.28     2.68 0.01
## -------------------------------------------------------- 
## : (7.99,10.1]
## : 7
##   vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## 1    1 136 0.21 0.06   0.19     0.2 0.04 0.11 0.44  0.33 1.18     2.14  0
## -------------------------------------------------------- 
## : (10.1,12.1]
## : 7
##   vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## 1    1 477 0.24 0.08   0.23    0.24 0.07 0.08 0.52  0.44 0.74     0.54  0
## -------------------------------------------------------- 
## : (12.1,14.2]
## : 7
##   vars   n mean   sd median trimmed  mad  min  max range skew kurtosis
## 1    1 267 0.33 0.09   0.32    0.32 0.07 0.12 0.76  0.64 0.68     1.64
##     se
## 1 0.01
## -------------------------------------------------------- 
## : (7.99,10.1]
## : 8
##   vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis
## 1    1 22 0.18 0.03   0.19    0.19 0.01 0.12 0.26  0.14 -0.54     0.93
##     se
## 1 0.01
## -------------------------------------------------------- 
## : (10.1,12.1]
## : 8
##   vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## 1    1 76 0.25 0.09   0.24    0.24 0.09 0.12 0.47  0.35 0.81    -0.08 0.01
## -------------------------------------------------------- 
## : (12.1,14.2]
## : 8
##   vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## 1    1 77 0.33 0.11   0.31    0.32 0.09 0.12 0.66  0.54 0.75     0.56 0.01
## -------------------------------------------------------- 
## : (7.99,10.1]
## : 9
## NULL
## -------------------------------------------------------- 
## : (10.1,12.1]
## : 9
## NULL
## -------------------------------------------------------- 
## : (12.1,14.2]
## : 9
## NULL

If 50 mg/l makes the unpleasant smell noticeable by wines, the negative relationship between wine quality and total sulfur dioxide should be stronger among wines with free sulfur dioxide above 50 mg/l (negative correlation at -0.26) than those below 50 mg/l (negative correlation at -0.13), as the boxplot below shows.

## [1] -0.1265438
## [1] -0.2585947

IV - Final Plots and Summary

Plot One - Quality vs. Alcohol

## group: 3
## NULL
## -------------------------------------------------------- 
## group: 4
##   vars   n  mean sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 163 10.15  1   10.1   10.08 1.04 8.4 13.5   5.1  0.7     0.15 0.08
## -------------------------------------------------------- 
## group: 5
##   vars    n mean   sd median trimmed  mad min  max range skew kurtosis
## 1    1 1457 9.81 0.85    9.5    9.71 0.74   8 13.6   5.6 1.08     1.07
##     se
## 1 0.02
## -------------------------------------------------------- 
## group: 6
##   vars    n  mean   sd median trimmed  mad min max range skew kurtosis
## 1    1 2198 10.58 1.15   10.5   10.52 1.33 8.5  14   5.5  0.4    -0.72
##     se
## 1 0.02
## -------------------------------------------------------- 
## group: 7
##   vars   n  mean   sd median trimmed  mad min  max range skew kurtosis
## 1    1 880 11.37 1.25   11.4   11.42 1.33 8.6 14.2   5.6 -0.3    -0.56
##     se
## 1 0.04
## -------------------------------------------------------- 
## group: 8
##   vars   n  mean   sd median trimmed  mad min max range  skew kurtosis  se
## 1    1 175 11.64 1.28     12   11.78 1.19 8.5  14   5.5 -0.89     0.01 0.1
## -------------------------------------------------------- 
## group: 9
## NULL

Description One

The reason why I chose this graph as one of the final plots is because among all the features, alcohol level shows the strongest correlation (corr = 0.44) with wine rating. Alcohol level has a relatively small range from 8% to 14.2% and a median value at 10.4%. The majority of our wine ratings fall between 5-7. Except for rating 4 category probably due to relative small sample size, a better-rated wine has a higher alcohol level (the left chart). For example, the median alcohol level among wines that are rated 8 on a scale of 0-10 is 12%, much higher than 9.5% among wines rated 5.

Plot Two - Quality vs. Alcohol and Free Sulfur Dioxide

## : <= 50mg/l, not noticeable
## : 3
## NULL
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 3
## NULL
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 4
##   vars   n  mean   sd median trimmed  mad min  max range skew kurtosis
## 1    1 149 10.23 0.99   10.1   10.15 1.04 8.4 13.5   5.1 0.69     0.17
##     se
## 1 0.08
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 4
##   vars  n mean  sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 14 9.36 0.8   9.15    9.24 0.74 8.6 11.5   2.9 1.24     1.01 0.21
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 5
##   vars    n mean   sd median trimmed  mad min  max range skew kurtosis
## 1    1 1108 9.91 0.87    9.7    9.82 0.74 8.4 13.6   5.2 0.92     0.57
##     se
## 1 0.03
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 5
##   vars   n mean   sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 349 9.49 0.68    9.4     9.4 0.44   8 13.5   5.5 1.79     5.28 0.04
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 6
##   vars    n  mean   sd median trimmed  mad min max range skew kurtosis
## 1    1 1808 10.72 1.15   10.6   10.68 1.19 8.5  14   5.5 0.27     -0.8
##     se
## 1 0.03
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 6
##   vars   n mean   sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 390  9.9 0.87    9.6    9.81 0.82 8.5 13.1   4.6 1.02     0.93 0.04
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 7
##   vars   n  mean   sd median trimmed  mad min  max range  skew kurtosis
## 1    1 795 11.43 1.25   11.5   11.49 1.33 8.6 14.2   5.6 -0.37     -0.5
##     se
## 1 0.04
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 7
##   vars  n  mean   sd median trimmed  mad min  max range skew kurtosis   se
## 1    1 85 10.82 1.05     11   10.81 0.89   9 13.3   4.3 0.04    -0.68 0.11
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 8
##   vars   n  mean   sd median trimmed  mad min max range  skew kurtosis  se
## 1    1 151 11.69 1.24     12   11.83 1.19 8.5  14   5.5 -0.87     0.13 0.1
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 8
##   vars  n mean   sd median trimmed  mad min  max range  skew kurtosis  se
## 1    1 24 11.3 1.49   11.8   11.38 1.04 8.9 12.9     4 -0.78    -1.09 0.3
## -------------------------------------------------------- 
## : <= 50mg/l, not noticeable
## : 9
## NULL
## -------------------------------------------------------- 
## : > 50mg/l, noticeable
## : 9
## NULL

Description Two

Free sulfur dioxide and alcohol are among the top three most important features (See Section 1 recursive feature elimination). Something about the odor makes things interesting. Following this great visualization example, I was able to combine three graphs to one in a different way from the first plot. Charts on the top and right side show the distribution of quality ratings and alcohol level, respectively, by whether the free sulfur dioxide is noticeable or not. As you can see, most wines cluster between 5-7 (okay wines) but wines with no noticeable free sulfur dioxide have slightly higher ratings between 6-7 and those with free sulfur dioxide above 50 mg/l have lower alcohol concentration. This pattern holds true no matter what quality rating is (middle graph). Holding wine quality consistent, for example, within wines that are rated 6, the median alcohol level (10.6%) is higher among those with no noticeable smell of sulfur than those otherwise (9.6%). For people who don’t like the sulfur smell and are worried about its health concern, one rule of thumb is perhaps to choose wines with higher alcohol level.

Plot Three - Free Sulfur Dioxide & Volatile Acidity vs. Alcohol

## group: (7.99,9.24]
##   vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## 1    1 841 41.54 15.77     42   41.37 16.31   5 128   123 0.32     0.94
##     se
## 1 0.54
## -------------------------------------------------------- 
## group: (9.24,10.5]
##   vars    n  mean    sd median trimmed   mad min   max range skew kurtosis
## 1    1 1723 37.55 17.98     36   36.75 19.27   3 138.5 135.5 0.57     0.64
##     se
## 1 0.43
## -------------------------------------------------------- 
## group: (10.5,11.7]
##   vars    n  mean    sd median trimmed   mad min max range skew kurtosis
## 1    1 1382 31.97 15.26     31   31.06 14.83   2 131   129 0.96     2.67
##     se
## 1 0.41
## -------------------------------------------------------- 
## group: (11.7,13]
##   vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## 1    1 789 30.47 12.71     30      30 11.86   3  96    93 0.77     2.52
##     se
## 1 0.45
## -------------------------------------------------------- 
## group: (13,14.2]
##   vars   n  mean    sd median trimmed   mad min max range skew kurtosis
## 1    1 138 27.88 12.14     28   27.35 13.34   3  65    62  0.4    -0.07
##     se
## 1 1.03
## group: (7.99,9.24]
##   vars   n mean  sd median trimmed  mad min  max range skew kurtosis se
## 1    1 841 0.28 0.1   0.27    0.27 0.07 0.1 0.82  0.71 1.46     3.25  0
## -------------------------------------------------------- 
## group: (9.24,10.5]
##   vars    n mean  sd median trimmed  mad  min max range skew kurtosis se
## 1    1 1723 0.28 0.1   0.26    0.27 0.09 0.08   1  0.92  1.7     5.88  0
## -------------------------------------------------------- 
## group: (10.5,11.7]
##   vars    n mean   sd median trimmed  mad  min  max range skew kurtosis se
## 1    1 1382 0.26 0.09   0.24    0.25 0.07 0.09 0.96  0.88 1.79     6.87  0
## -------------------------------------------------------- 
## group: (11.7,13]
##   vars   n mean  sd median trimmed  mad  min max range skew kurtosis se
## 1    1 789  0.3 0.1   0.29    0.29 0.07 0.08 1.1  1.02 1.45     6.36  0
## -------------------------------------------------------- 
## group: (13,14.2]
##   vars   n mean   sd median trimmed mad  min  max range skew kurtosis   se
## 1    1 138 0.37 0.12   0.35    0.36 0.1 0.15 0.78  0.64 0.87     1.18 0.01

Description Three

Alcohol, free sulfur dioxide and volatile acidity became main features of interest from Section I and I think it’s important to understand how they interact with each other. Alcohol level appears to be an interesting element because it is not only correlated with wine ratings but also in some way it can enhance acidity and mask unpleasant odor. The combined two charts below plot alcohol against free sulfur dioxide and volatile acidity. The higher the alcohol level is, the less the free sulfur dioxide will be. For example, the median free sulfur dioxide amount among 13-14.2% alcohol group is only 28 mg/l, much less than 42 mg/l among 7.99-9.24% alcohol group. The relatinship between acidity and alcohol becomes more clear among higher alcohol groups. For instance, the median volatile acidity amount among 13-14.2% alcohol group is 0.35 g/l as compared to 0.24 g/l among 10.5-11.7% alcohol group.

V - Reflection

My learnings after exploring the white wine dataset:

  1. Alcohol is an important factor for the wine taste. At the same times, it interacts with other physcochemicals. For example, it can suppress the unpleasant odor and enhance acidity.

  2. Free sulfur dioxide is really critical. At the level of 50m g/l, it becomes excessive and makes unpleasant smell noticeable that hurts the taste bud.

This white wine dataset is the most tidy one I’ve ever used for Udacity projects. However, I was frustrated in the beginning because except alcohol, almost all other input variables don’t have a strong relationship with wine quality. Reading correlation matrix is not enough. When conditioning on other relevant variables, the relationships between the physicochemical properties and quality became clear. Also, all input variables are continous variables which limited the type of graphs I could make. One solution I made was to recode to categorical variables.

The other problem I had is my knowledge about the physicochemicals and how they interacted were limited before starting this project. I had to resort to additional readings to brush up my wine knowledge.

This dataset is pretty limited with 13 input variables (technically 12 can be used for analysis because one of them is ID variable), it will be great if other variables such as grape type and wine age can be included for further investigation.